76 research outputs found

    A model based approach to Spotify data analysis: a Beta GLMM

    Get PDF
    Digital music distribution is increasingly powered by automated mechanisms that continuously capture, sort and analyze large amounts of Web-based data. This paper deals with the management of songs audio features from a statistical point of view. In particular, it explores the data catching mechanisms enabled by Spotify Web API, and suggests statistical tools for the analysis of these data. Special attention is devoted to songs popularity and a Beta model including random eïŹ€ects is proposed in order to give a ïŹrst answer to questions like: which are the determinants of popularity? The identiïŹcation of a model able to describe this relationship, the determination within the set of characteristics of those considered most important in making a song popular is a very interesting topic for those who aim to predict the success of new products

    Clustering alternatives in preference-approvals via novel pseudometrics

    Get PDF
    Preference-approval structures combine preference rankings and approval voting for declaring opinions over a set of alternatives. In this paper, we propose a new procedure for clustering alternatives in order to reduce the complexity of the preferenceapproval space and provide a more accessible interpretation of data. To that end, we present a new family of pseudometrics on the set of alternatives that take into account voters’ preferences via preference-approvals. To obtain clusters, we use the Ranked k-medoids (RKM) partitioning algorithm, which takes as input the similarities between pairs of alternatives based on the proposed pseudometrics. Finally, using non-metric multidimensional scaling, clusters are represented in 2-dimensional space

    The Neutrophil-to-Lymphocyte Ratio is Related to Disease Activity in Relapsing Remitting Multiple Sclerosis

    Get PDF
    Background: The role of the neutrophil-to-lymphocyte ratio (NLR) of peripheral blood has been investigated in relation to several autoimmune diseases. Limited studies have addressed the significance of the NLR in terms of being a marker of disease activity in multiple sclerosis (MS). Methods: This is a retrospective study in relapsing\u2013remitting MS patients (RRMS) admitted to the tertiary MS center of Catania, Italy during the period of 1 January to 31 December 2018. The aim of the present study was to investigate the significance of the NLR in reflecting the disease activity in a cohort of early diagnosed RRMS patients. Results: Among a total sample of 132 patients diagnosed with RRMS, 84 were enrolled in the present study. In the association analysis, a relation between the NLR value and disease activity at onset was found (V-Cramer 0.271, p = 0.013). In the logistic regression model, the variable NLR (p = 0.03 ExpB 3.5, CI 95% 1.089\u201311.4) was related to disease activity at onset. Conclusion: An elevated NLR is associated with disease activity at onset in RRMS patients. More large-scale studies with a longer follow-up are needed

    Variable selection in mixed models: a graphical approach

    No full text
    Model selection can be defined as the task of estimating the performance of dif- ferent models in order to choose the (approximate) best one. The purpose of this article is to introduce an extension of the graphical representation of deviance proposed in the framework of classical and generalized linear models to the wider class of mixed models. The proposed plot is useful in determining which are the important explanatory variables conditioning on the random effects part. The applicability and the easy interpretation of the graph are illus- trated with a real data examples

    Random forest analysis: a new approach for classication of Beta Thalassemia

    Get PDF
    In recent years, Thalassemia care providers started classifying patients as transfusion- dependent-Thalassemia (TDT) or non-transfusion-dependent-Thalassemia (NTDT) owing to the established role of transfusion therapy in dening the clinical complication prole, although this classication was also based on expert opinion and is limited by reliance on patients'current transfusion status. Starting from a vast set of variables indicating severity phenotype, through the use of both classication and clustering techniques we want to explore the presence of two (TDT vs NTDT) or more clusters, in order to approaching to a new denition for the classication of Beta-Thalassemia in Thalassemia Syndromes (TS)

    Weighted and unweighted distances based decision tree for ranking data

    Get PDF
    Preference data represent a particular type of ranking data (widely used in sports, web search, social sciences), where a group of people gives their preferences over a set of alternatives. Within this framework, distance-based decision trees represent a non-parametric tool for identifying the profiles of subjects giving a similar ranking. This paper aims at detecting, in the framework of (complete and incomplete) ranking data, the impact of the differently structured weighted distances for building decision trees. The traditional metrics between rankings don’t take into account the importance of swapping elements similar among them (element weights) or elements belonging to the top (or to the bottom) of an ordering (position weights). By means of simulations, using weighted distances to build decision trees, we will compute the impact of different weighting structures both on splitting and on consensus ranking. The distances that will be used satisfy Kemenys axioms and, accordingly, a modified version of the rank correlation coefficient τx, proposed by Edmond and Mason, will be proposed and used for assessing the trees’ goodness

    GAMLSS for high-variability data: an application to liver fibrosis case.

    No full text
    In this paper, we propose management of the problem caused by overdispersed data by applying the generalized additive model for location, scale and shape framework (GAMLSS) as introduced by Rigby and Stasinopoulos (2005). The idea of using a GAMLSS approach for handling our problem comes from the idea of Aitkin (1996) consisting in the use of an EM maximum likelihood estimation algorithm (Dempster, Laird, and Rubin, 1977) to deal with overdispersed generalized linear models (GLM). As in the GLM case, the algorithm is initially derived as a form of Gaussian quadrature assuming a normal mixing distribution. The GAMLSS specification allows the extension of the Aitkin algorithm to probability distributions not belonging to the exponential family. In particular, aim of this work is to show the importance of using a GAMLSS strutcure when a mixture is used to provide a natural representation of heterogeneity in a finite number of latent classes (Celeux and Diebolt, 1992)

    Classification trees for preference data: a distance-based approach

    No full text
    In the framework of preference rankings, when the interest lies in explaining which predictors and which interactions among predictors are able to explain the observed preference structures, the possibility to derive consensus measures using a classi cation tree represents a novelty and an important tool given its easy interpretability. In this work we propose the use of a multivariate decision tree where a weighted Kemeny distance is used both to evaluate the distances between rankings and to de ne an impurity measure to be used in the recursive partitioning. The proposed approach allows also to weight di erently high distances in rankings in the top and in the bottom alternatives

    Dealing with the Pseudo-Replication Problem in Longitudinal Data from Posidonia Oceanica Surveys: Modeling Dependence vs. Subsampling

    No full text
    Posidonia oceanica represents the key species of the most important ecosystem in subtidal habitats of the Mediterranean Sea. Being sensitive to changes in the environment, it is considered a crucial indicator of the quality of coastal marine waters. A peculiarity of P. oceanica is the presence of reiterative modules characterizing its growth, which lend themselves to back-dating techniques, allowing for the reconstruction of past history of growth variables (annual rhizome elongation and diameter, primary production, etc.). Such back-dating techniques provide, for each sampled shoot, a longitudinal series of multivariate data; this is an instance of what Hurlbert (1984) in a seminal paper defined as “pseudo-replications”, for which it becomes crucial to take into account the possible dependence of the data. A common solution to the “pseudo-replications” in the ecological literature is represented by “pseudo-replications”: given repeated measurements on the same unit, only a random sub-sample of such measurements is analyzed, in order to attenuate correlation and obtain approximately independently distributed observations, to which standard statistical methods can be applied. In its most extreme version, only one measurement is randomly drawn for each unit, i.e. the sub-sampling size is one. If on one hand sub-sampling attenuates correlation, on the other it implies a loss of information (due to the reduction of the total sample size) and then requires a higher number of sampling units to ensure a specified level of efficiency and power. In the talk, we contrast sub-sampling with the alternative approach of handling dependence directly in the modelling stage, using the class of Generalized Linear Mixed Models. We show that this approach permits remarkable gains of precision in estimation and power in testing, without requiring the increase in sample sizes involved in sub-sampling, and thus avoiding the practice of over-sampling, which has a negative impact on aquatic ecosystems
    • 

    corecore